BiMr - Bird Migration Tool

Scholarly Html Technical Report

BiMr - Bird Migration Tool

  1. Iulian Gîlcă
  2. Rareș Enache
  3. Răzvan Foca
  1. Dr. Sabin-Corneliu Buraga

Abstract

Considering the tweets provided by the Twitter users and the data feed provided by the eBird API, our application will be able to generate an interactive map in real-time, representing the current tendency of birds to move from one place to another. In that way we can visualize if there is a massive migration flow.

Motivation

The idea of this project came from the fact that there are few accessible applications that allow users to see detailed information about the birds migration around the entire world. This kind of instrument can be useful in many fields, especially in ornithology, although everyone without any professional background on this field can draw conclusions after seeing a real-time updated map that illustrates this phenomena.

Twitter is an online news and social networking application where users post and read 140-character messages called "tweets". Besides the content of the tweet, users may also make available the location, so people will know where the author was located when he posted the tweet.

eBird API 1.1 is a public API created by Cornell University, a private and statutory Ivy League research university located in Ithaca, New York. For our particular application, it provides a real-time updated data feed concerning bird migration phenomena. For example, it can offer show pieces of information such as hotspot sightings summaries, recent notable observations at hotspots, locations, regions or of a certain species. Moreover, along with these categories of information, each one offers data about the name of sighted species (both common naming and scientific), geographical coordinates, the name of the person that made the observation etc.

Our challenge is to build a technology to collect, process and display data, all in real time. All data managed by the developed Web application must be described in RDF – based on existing knowledge models. In this way, we will be able to visualize the migration of birds in a user friendly and intuitive manner using a web interface updated in real time.

Technologies

The main technologies we'll use in this project are :

Our proposal

Our proposal is meant to offer a scalable solution of obtaining and analysing tweets in real time and data feed from eBird API, concerning the aforementioned topic. For this, we will develope our application by adapting the following code architecture:

Figure 1. Architecture model

Architecture (Modules)

After studying the architecture shown above, we have come up with the following four main components of our application:

Figure 2. Communication between modules

Data collector

First of all we need to have a contiously stream of data provided by Twitter and eBirds. A module which collects data from Twitter Public Streaming API and the eBird will be implemented. For the twitter data we will be using twitter4j, Java library to get tweets from Twitter. Twitter credentials are required for twitter4j to be able to pull tweets. We get these credentials by creating a new app on twitter. By all means, this does not require the user to provide his actual credentials but for each created application, Twitter provides secrets to serve in authenticating via OAuth.
A Job Scheduler will make calls every 24 hours to the eBird API and will collect all the new data. eBird API api does not require any kind of authentification or key and can be use at anytime. For Twitter the Job Scheduler will make calls at every 15 minutes (it is a constraint from Twitter).

Figure 3. BPMN Diagram

Data processing

Now that we have a continously fresh stream of data we can filter and process tweets. We keep only tweets which have the right hasttag (eg: #birdMigration, #birds, #birdsmig, etc) and come with a place or a specified location. Nowadays the users deactivate their localisation service of their phones and the tweets that have the location included are very few. Also, the tweet must to include the common name or the scientific name of the birds species. To filter the tweets that meets this criterias we will analyse each tweet's content using Stanford NLP TokensRegex framework, defining different patterns over text, patterns which match with the description of a location/bird. In the case of data provided by the eBird API the things are way simpler because the API provides all information needed by our application.

In order to find out if a simple word represents a name for a bird species or any other interesting entity (like person, link, location, author, etc.) we manually trained a model. These data represents the classifier in action. Obviously, some assumptions are wrong, because we need a lot more (manually) annotated tweets to obtain good results. From our calculations, we can obtain 65% rate of success with this classifier.

[hawk] pos: [NN] ne: [BISP] [today] pos: [NN] ne: [TIME] [winter] pos: [NN] ne: [DATE] [hawk] pos: [NN] ne: [BISP] [hawk] pos: [NN] ne: [BISP] [White-throated] pos: [JJ] ne: [BISP] [Osprey] pos: [NNP] ne: [BISP] [Osprey] pos: [NNP] ne: [BISP] [Osprey] pos: [NNP] ne: [BISP] [Whopper] pos: [NNP] ne: [BISP] [Osprey] pos: [NNP] ne: [BISP] [Washington] pos: [NNP] ne: [LOC] [today] pos: [NN] ne: [TIME] [!!!] pos: [CD] ne: [NUMBER] [Osprey] pos: [NNP] ne: [BISP] [Osprey] pos: [NNP] ne: [BISP] [Osprey] pos: [NNP] ne: [BISP] [Urbanized] pos: [JJ] ne: [BISP] [Eagle] pos: [NNP] ne: [BISP] [Eagle] pos: [NNP] ne: [BISP] [Friday] pos: [NNP] ne: [BISP] [eagles] pos: [NNS] ne: [BISP] [today] pos: [NN] ne: [TIME] [Eagle] pos: [NNP] ne: [BISP] [???] pos: [CD] ne: [NUMBER] [Tufted] pos: [JJ] ne: [BISP] [Eagle] pos: [NNP] ne: [BISP] [Tricolored] pos: [JJ] ne: [BISP] [eagle] pos: [NN] ne: [BISP] [Eagle] pos: [NNP] ne: [BISP] [Eagle] pos: [NNP] ne: [BISP] [Eagles] pos: [NNPS] ne: [BISP] [red-tailed] pos: [JJ] ne: [BISP] [Eagle] pos: [NNP] ne: [BISP] [6] pos: [CD] ne: [NUMBER] [Killdeer] pos: [NNP] ne: [BISP] [this] pos: [DT] ne: [TIME] [morning] pos: [NN] ne: [TIME] [one] pos: [CD] ne: [NUMBER] [4] pos: [CD] ne: [NUMBER] [Killdeer] pos: [NN] ne: [BISP] [Killdeer] pos: [NNP] ne: [BISP] [day] pos: [NN] ne: [DURATION] [Killdeer] pos: [NNP] ne: [BISP] [Killdeers] pos: [NNS] ne: [BISP] [winter] pos: [NN] ne: [DATE] [months] pos: [NNS] ne: [DURATION] [Killdeer] pos: [NN] ne: [BISP] [Killdeer] pos: [NNP] ne: [BISP] [one] pos: [CD] ne: [NUMBER] [today] pos: [NN] ne: [TIME] [The] pos: [DT] ne: [DATE] [other] pos: [JJ] ne: [DATE] [day] pos: [NN] ne: [DATE] [I] pos: [CD] ne: [NUMBER] [Killdeer] pos: [NNP] ne: [BISP] [Killdeers] pos: [NNP] ne: [BISP] [today] pos: [NN] ne: [TIME]

Data Persistance

All of the collected data is filtered in respect to the necessities of our application and the results are stored in RDFa format. The same model of data is used for both the information coming from Twitter and the information coming from eBird API.

Figure 4. Application sequence diagram

Internal Data Models

The Tweet is the essential entity in our application, therefore the comunication between modules implies sending and receiving information about tweets and their owners. All the stored data will be in the RDF/XML format, aiming to model the necessary information about a tweet.

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:vcard="http://www.w3.org/2001/vcard-rdf/3.0#" xmlns:bimr="http://xmlns.com/bimr#"> <rdf:Description rdf:about="http://xmlns.com/bimr/user#132"> <vcard:NAME>Markus Berg</vcard:NAME> <vcard:UID>131242</vcard:UID> <vcard:ADR>Ohio</vcard:ADR> <vcard:EMAIL>marksbrg@gmail.com</vcard:EMAIL> <vcard:NICKNAME>MarkB</vcard:NICKNAME> <bimr:hasGeoEnabled>true</bimr:hasGeoEnabled> </rdf:Description> </rdf:RDF>
RDF/XML data format modelling an user
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:bimr="http://xmlns.com/bimr#" xmlns:tweet="http://xmlns.com/tweet#"> <rdf:Description rdf:about="http://xmlns.com/bimr/tweet#32"> <bimr:id>32</bimr:id> <tweet:language>en</tweet:language> <tweet:text>Eagle Friday! This pair of eagles are our observation today.</tweet:text> <tweet:link>https://tinyurl.com/214adfjkhv4a1</tweet:link> <tweet:author>Marcel Ron</tweet:author> </rdf:Description> </rdf:RDF>
RDF/XML data format modelling a tweet
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:location="http://xmlns.com/location#"> <rdf:Description rdf:about="http://xmlns.com/bimr/location#935a5197-4eef-40e6-957a-d2f90eaf2da9"> <location:latitude>57.52</location:latitude> <location:longitude>12.81</location:longitude> <location:city>Ildaho</location:city> <location:state>Ohio</location:state> <location:country>USA</location:country> </rdf:Description> </rdf:RDF>
RDF/XML data format modelling a location
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:bimr="http://xmlns.com/bimr#" xmlns:location="http://xmlns.com/location#" xmlns:tweet="http://xmlns.com/tweet#" xmlns:observation="http://xmlns.com/observation#"> <rdf:Description rdf:about="http://xmlns.com/bimr/observation#a653e810-9b83-4f43-a21b-cfaa84e789e1"> <observation:birdSpecies>eagles</bimr:birdSpecies> <observation:birdSpecies>hawks</bimr:birdSpecies> <observation:date>Fri Jan 05 00:00:00 EET 2018</bimr:date> <observation:informationSourceId>twitter</bimr:informationSourceId> <observation:location rdf:about="http://xmlns.com/bimr/location#935a5197-4eef-40e6-957a-d2f90eaf2da9"> <location:latitude>57.52</location:latitude> <location:longitude>12.81</location:longitude> <location:city>Ildaho</location:city> <location:state>Ohio</location:state> <location:country>USA</location:country> </rdf:Description> <observation:tweet rdf:parseType="Resource"> <tweet:tweetId>32</bimr:tweetId> <tweet:text>Eagle Friday! This pair of eagles and hawsks are our observation today.</tweet:text> </rdf:Description> </rdf:Description> </rdf:RDF>
RDF/XML data format modelling an observation
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:bimr="http://xmlns.com/bimr#" xmlns:vcard="http://www.w3.org/2001/vcard-rdf/3.0#" xmlns:location="http://xmlns.com/location#" xmlns:tweet="http://xmlns.com/tweet#" xmlns:observation="http://xmlns.com/observation#"> <rdf:Description rdf:about="http://xmlns.com/bimr/hotspot#69003615-2335-4b59-930c-44ce1eb265d2"> <rdf:Description rdf:about="http://xmlns.com/bimr/user#132"> <vcard:NAME>Markus Berg</vcard:NAME> <vcard:UID>131242</vcard:UID> <vcard:ADR>Ohio</vcard:ADR> <vcard:EMAIL>marksbrg@gmail.com</vcard:EMAIL> <vcard:NICKNAME>MarkB</vcard:NICKNAME> <bimr:hasGeoEnabled>true</bimr:hasGeoEnabled> </rdf:Description> <rdf:Description rdf:about="http://xmlns.com/bimr/observation#a653e810-9b83-4f43-a21b-cfaa84e789e1"> <observation:birdSpecies>eagles</bimr:birdSpecies> <observation:birdSpecies>hawks</bimr:birdSpecies> <observation:date>Fri Jan 05 00:00:00 EET 2018</bimr:date> <observation:informationSourceId>twitter</bimr:informationSourceId> <observation:location rdf:parseType="Resource"> <location:longitude>-102.2</location:longitude> <location:latitude>80.21</location:latitude> </observation:location> <observation:tweet rdf:about="http://xmlns.com/bimr/tweet#32"> <tweet:tweetId>32</bimr:tweetId> <tweet:text>Eagle Friday! This pair of eagles and hawsks are our observation today.</tweet:text> </rdf:about> </rdf:Description> </rdf:Description> </rdf:RDF>
RDF/XML data format modelling a hotspot

Data visualization

All of the data which have been collected and processed will be renderend on a world map with the help of Google Maps API. The vizualization of migration of the birds will be shown in timeframes, for every day the application will display the location where the birds we're spoted, and as such over the course of several weeks, the application will be able to show the birds migration in an animated manner. The website will also display information about the birds that we're spotted, like the species, number of birds that we're spotted in a location and a small description of the species with picture. The flocks of birds will be clusterized by region. When the user will zoom in, the bigger cluster will be divided in multiple smaller clusters.

Also, as a part of the visualization module, we are going to create some statistics over time, after collecting amounts of data. These statistics will show various information like what species migrates from one area to another, what species are the most present in a region, what are the prefered migration routes of certain species each year, how much time they spend in one place (amount of months, for example) etc.

Up above is an example of geoJSON, a format for encoding a variety of geographic data structures. We use it to pinpoint bird hotspots on the map so we can further illustrate the migration phenomena.

Figure 5. Birds that we're spotted
Figure 6. Species Information

Open API

The application will provide an REST API which can be used to access the data collected by the aplication. The API will define a set of functions which will offer information about the data, from where has been collected, statistics about the data and the data itself. Developers can perform requests and receive responses via HTTP protocol such as GET and POST. The responses will be in JSON format.

All the endpoints of the API will require GET methods to be accessed. Some example of these endpoints:

  • /tweets
    • /ws/bimr/getAllTweets – returns all the tweets stored by the application;
  • /data
    • /ws/bimr/getAllMigrations – returns the data collected by the application in the last seven days;
    • /ws/bimr/getMigrationsByDate{startDate}/{endDate}– returns all the data collected by the application at the specified location and in the range specified (km) ;
  • /stats
    • /ws/bimr/getMostObservedSpecies - returns a list with all species ordered by the number of their appeareance in tweets

The representation of migration object coming from the Open API:


            {
              "features": [
                  {
                      "geometry": {
                          "coordinates": [
                              7.17,
                              58.68
                          ],
                          "type": "Point"
                      },
                      "type": "Feature",
                      "properties": {
                          "migrationId": "c557c9b3-3edf-493d-94ec-b2ad278cf8c5",
                          "toHotspot": {
                              "observationDate": {
                                  "year": 2018,
                                  "month": "JANUARY",
                                  "dayOfMonth": 5,
                                  "dayOfWeek": "FRIDAY",
                                  "dayOfYear": 5,
                                  "monthValue": 1,
                                  "nano": 123000000,
                                  "hour": 9,
                                  "minute": 30,
                                  "second": 10,
                                  "chronology": {
                                      "id": "ISO",
                                      "calendarType": "iso8601"
                                  }
                              },
                              "hotspotId": "4b9c5017-b016-44d0-9677-b71190a826d5",
                              "geometry": {
                                  "coordinates": [
                                      12.81,
                                      57.52
                                  ],
                                  "type": "Point"
                              },
                              "birdSpeciesList": [
                                  "Eagle"
                              ]
                          },
                          "species": "Eagle",
                          "toHotspotId": "4b9c5017-b016-44d0-9677-b71190a826d5",
                          "fromHotspot": {
                              "observationDate": {
                                  "year": 2018,
                                  "month": "JANUARY",
                                  "dayOfMonth": 3,
                                  "dayOfWeek": "WEDNESDAY",
                                  "dayOfYear": 3,
                                  "monthValue": 1,
                                  "nano": 123000000,
                                  "hour": 9,
                                  "minute": 30,
                                  "second": 10,
                                  "chronology": {
                                      "id": "ISO",
                                      "calendarType": "iso8601"
                                  }
                              },
                              "hotspotId": "356f4742-907d-4ae1-a02c-c1df2f5ddd09",
                              "geometry": {
                                  "coordinates": [
                                      7.17,
                                      58.68
                                  ],
                                  "type": "Point"
                              },
                              "birdSpeciesList": [
                                  "Eagle"
                              ]
                          },
                          "fromHotpsotId": "356f4742-907d-4ae1-a02c-c1df2f5ddd09"
                      }
                  }
                }
          

Results for non-functional testing (stress testing)

Apache JMeter™
The Apache JMeter™ application is open source software, a 100% pure Java application designed to load test functional behavior and measure performance. It was originally designed for testing Web Applications but has since expanded to other test functions.

Figure 7. Latency graph for 100 users accessing BiMr application
Figure 8. Latency graph for 200 users accessing BiMr application
Figure 9. Latency graph for 500 users accessing BiMr application

We can easily observe that handling 100 users requesting access to our application doesn't take too much.

Almost the same goes for the case where BiMr server has to cope with 200 requests at once from 200 users. There are some increase spikes in latency however, especially when our application makes use of the scheduled caller to the Twitter API.

When it comes to handling 500 users at the same time, the application server can handle about one hundred requests before starting to lag (latency spikes to be observed). Moreover, after about 300 requests, the server enters a freeze state where it cannot handle the remaining requests in the waiting queue.

We can easily see that the application could handle a relatively small but important number of users in its incipient phases and could really have improvements on how it can handle the requests more efficiently in the future versions.

Graphics obtained by using Apache JMeter to stress test the application server

External data sources

Twitter Public Streaming API
The following streams offer samples of the public data flowing through Twitter. Once applications establish a connection to a streaming endpoint, they are delivered a feed of Tweets, without needing to worry about polling or exceeding REST API quotas.

Twitter4J
Is an open source Java library used to integrate our application with Twitter services.

The Google Maps Geocoding API
Used for retrieving information about a place mentioned in a tweet, such as latitude and longitude.

{
    "formatted_address": "Iasi, Romania",
    "geometry": {
        "bounds": {
            "northeast": {
                "lat": 47.2274375,
                "lng": 27.6969839
            },
            "southwest": {
                "lat": 47.0848370999999,
                "lng": 27.4769569
            }
        },
        "location": {
            "lat": 47.1584549,
            "lng": 27.6014418
        },
        "location_type": "APPROXIMATE",
        "viewport": {
            "northeast": {
                "lat": 47.2274375,
                "lng": 27.6969839
            }
        }
    }
}
          
Response received from the Google Geocoding API for a specific city

Conclusion

The application will provide a real-time migration map of different species of birds across the globe with the help of live data in the form of tweets from twitter and also from stable data coming from the eBird API for people who love birds or who have a hooby in spotting birds or those of whom just have curiosity about the birds migration.

References